Does the smelliness of your city impact the kind of perfume you wear?

Author

Olivia Li and Riya Sinha

Introduction

We are trying to explore whether the smelliness of the (US) city impacts the kind of perfume the dwellers wear. This project integrates perfume sales data from eBay with air quality metrics to analyze purchasing patterns across US cities. We have broken this topic up into two separate goals:

  1. Determine whether there is a noticeable difference in sales between perfume longevity among cities with higher/lower air pollution
  2. Link perfume notes to cities with higher or lower air pollution

Data Sources:

We will be looking at three data sources from different data sources found on Kaggle:

  1. A dataset of global perfume brands, notes, types, and longevity (Perfume Dataset)

    • “This dataset was synthetically generated and curated with the assistance of AI (ChatGPT, OpenAI GPT-5). It was inspired by public fragrance resources such as Fragrantica, Basenotes, brand catalogs, and community reviews but does not reproduce copyrighted or proprietary content. It is designed for educational and research purposes only, providing a structured reference for fragrance analysis, experimentation, and machine learning projects”

  2. A dataset of perfume listings on eBay shown with items sold and where they are located (estimating the most popular perfumes in a city) (e-Commerce Dataset)

    • “This dataset has been ethically mined from eBay using the Apify API, ensuring compliance with ethical data collection practices.”

  3. A dataset of AQI (Air Quality Index) by city to determine their “smelliness” (Global Pollution Dataset)

    • “These datas are collected by using web scraping and slightly preprocessed by using hand engineering. These datas are collected from elichens.com”

Issues with the Data Sources

As the data sources were all csv files, there were no issues in importing and reading into R Studio as well as Power Query.

Transformations:

This analysis will require us to perform multiple joins to get the final table that includes the perfume names, brands, units sold, location/city sold to, and city AQI level. Additionally, we will need to utilize different techniques to clean and run diagnostics on the data.

We began by filtering the Global Pollution data set to narrow down the US Cities and their corresponding AQI Values. We then concatenated men and women’s eBay Perfume eCommerce data. In order to match this with the US Cities’ pollution data, we needed to split the itemLocation by its delimiter and parse out the irrelevant, duplicate, or redundant location names, of which there were many.

To join the unstructured eBay perfume titles (1 Million by Paco Rabanne 3.4 Fl oz / 100 ml PARFUM Spray Men’s New & Sealed) with the structured perfume attribute data (name, brand, type, longevity, category), we utilized AI to create an intermediary matching table with a 0.6 confidence threshold. The final dataset merged these three sources—sales, pollution, and perfume characteristics—through left joins, followed by removal of null values to create a clean dataset ready for analysis of how perfume preferences correlate with air quality levels across different US cities.

Code

Setting up and importing packages

library(DBI)
library(duckdb)
library(dbplyr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::ident()  masks dbplyr::ident()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::sql()    masks dbplyr::sql()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reticulate)
options(duckdb.enable_rstudio_connection_pane=TRUE)
py<- import_main()
#py_install("pandas", pip = TRUE)
#py_module_available("pandas")
# create / connect to in memory database (not stored in a file)
drv <- duckdb()
con <- dbConnect(drv)

Importing Data Sets

import pandas as pd
from io import StringIO

#pollution dataset
py_global_pollution = pd.read_csv('~/Desktop/Data Wrangling/data files/global air pollution dataset.csv')

#ecommerce datasets
py_ebay_women = pd.read_csv('~/Desktop/Data Wrangling/data files/ebay_womens_perfume.csv')
py_ebay_men = pd.read_csv('~/Desktop/Data Wrangling/data files/ebay_mens_perfume.csv')

#perfume notes and brands
py_perfumes = pd.read_csv('~/Desktop/Data Wrangling/data files/Perfumes_dataset.csv')

#intermediary table
py_intermediary = pd.read_csv('~/Desktop/Data Wrangling/data files/perfume_matching_table.csv')
CREATE OR REPLACE TABLE ebay_women AS
  SELECT *
  FROM read_csv('~/Desktop/Data Wrangling/data files/ebay_womens_perfume.csv')
CREATE OR REPLACE TABLE ebay_men AS
  SELECT *
  FROM read_csv('~/Desktop/Data Wrangling/data files/ebay_mens_perfume.csv')
CREATE OR REPLACE TABLE perfumes AS
  SELECT *
  FROM read_csv('~/Desktop/Data Wrangling/data files/Perfumes_dataset.csv')
CREATE OR REPLACE TABLE intermediary AS
  SELECT *
  FROM read_csv('~/Desktop/Data Wrangling/data files/perfume_matching_table.csv')
CREATE OR REPLACE TABLE global_pollution AS
  SELECT *
  FROM read_csv('~/Desktop/Data Wrangling/data files/global air pollution dataset.csv')
r_perfumes <- read_csv("~/Desktop/Data Wrangling/data files/Perfumes_dataset.csv", show_col_types = FALSE)
r_ebay_mens <- read_csv("~/Desktop/Data Wrangling/data files/ebay_mens_perfume.csv", show_col_types = FALSE)
r_ebay_womens <- read_csv("~/Desktop/Data Wrangling/data files/ebay_womens_perfume.csv", show_col_types = FALSE)
r_global_pollution <- read_csv("~/Desktop/Data Wrangling/data files/global air pollution dataset.csv", show_col_types = FALSE)
r_intermediary <- read_csv("~/Desktop/Data Wrangling/data files/perfume_matching_table.csv", show_col_types = FALSE)

For Power Query on Excel, we imported our four datasets using “text/csv: to choose data source.

We then see the following queries

Diagnostics

Create a dataframe for global_pollution to check whether there are duplicate city names in the global data set

py_global_pollution_diag = (py_global_pollution
                                .groupby(['City', 'Country'])
                                .agg(dupe_count=('City', 'size'))    #count rows of duplicates after group by
                                .sort_values('dupe_count', ascending=False)
                                .reset_index()
                                )
py_global_pollution_diag.shape
(23035, 3)
py_global_pollution.shape
(23463, 12)
  SELECT Country, 
         COUNT (DISTINCT City) AS num_cities
  FROM global_pollution
  GROUP BY Country
--427 cities identified with no corresponding country
  HAVING Country = 'United States of America'
-- 2872 distinct cities is the US
1 records
Country num_cities
United States of America 2872
r_global_pollution |>
  distinct(.keep_all = TRUE) |>
  count(City) |>
  filter(n>1)
# 0 rows means there are no duplicates

We ran diagnostics by first grouping by City to find the number of times each city appears in the global pollution dataset. We then filtered to see if any City appears more than once. And then we counter the total number of cities.

Narrowing down global pollution to US cities

py_us_pollution = (py_global_pollution
                  .query("Country == 'United States of America'"))
py_us_pollution.loc[:,'City'] = py_us_pollution['City'].str.lower()
py_us_pollution.shape
(2872, 12)
py_us_pollution.info()      #checking that no null values were included
<class 'pandas.core.frame.DataFrame'>
Index: 2872 entries, 5 to 23461
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Country             2872 non-null   object
 1   City                2872 non-null   object
 2   AQI Value           2872 non-null   int64 
 3   AQI Category        2872 non-null   object
 4   CO AQI Value        2872 non-null   int64 
 5   CO AQI Category     2872 non-null   object
 6   Ozone AQI Value     2872 non-null   int64 
 7   Ozone AQI Category  2872 non-null   object
 8   NO2 AQI Value       2872 non-null   int64 
 9   NO2 AQI Category    2872 non-null   object
 10  PM2.5 AQI Value     2872 non-null   int64 
 11  PM2.5 AQI Category  2872 non-null   object
dtypes: int64(5), object(7)
memory usage: 291.7+ KB
CREATE OR REPLACE TABLE sql_us_pollution AS
  SELECT 
        Country,
        LOWER(City) AS City,
        "AQI Value",
        "AQI Category",
        "CO AQI Value",
        "CO AQI Category",
        "Ozone AQI Value",
        "Ozone AQI Category",
        "NO2 AQI Value",
        "NO2 AQI Category",
        "PM2.5 AQI Value",
        "PM2.5 AQI Category"
  FROM global_pollution
  WHERE Country = 'United States of America'

Ensuring that there are no duplicate cities within the us pollution data set. The set has 2872 distinct cities and that matches with the 2872 row count.

SELECT 
        COUNT(DISTINCT City) AS city_num, 
        COUNT(*) AS row_num
FROM sql_us_pollution
1 records
city_num row_num
2872 2872
r_us_pollution <- r_global_pollution |>
  filter(Country == "United States of America") |>
  mutate(City= tolower(City))
nrow(r_us_pollution)
[1] 2872

Filtered Country column in pq_global_pollution to only include “United States of America,” and then renamed the query to pq_us_pollution.

Transforming City column to lowercase. Checking that no null values are included. We changed all City values to lowercase for consistency

No null values included.

Merge with eBay datasets (men and women), and unnest itemLocation

Stacking men and women data sets on top of each other

py_ebay= pd.concat([py_ebay_women, py_ebay_men])
py_long_ebay= (py_ebay
      .assign(City = lambda py_ebay_df_: py_ebay_df_['itemLocation']
                    .str.split(',')                                     
                    .apply(lambda x: [item.strip() for item in x]))                   #removing leading and trailing white space
      .drop(columns = ['itemLocation'])
      .explode('City')
      .filter(items = [
                      'title',
                      'sold',
                      'City'
                      ])
)
#making City and title entries all lowercase before joining
py_long_ebay[['City', 'title']] = py_long_ebay[['City', 'title']].apply(lambda x: x.str.lower())         
py_long_ebay.shape
(5897, 3)
CREATE OR REPLACE TABLE sql_long_ebay AS
  SELECT LOWER(title) AS title,
         sold,
         LOWER(                                      --standardizing case on itemLocation to match python
          TRIM(                                      --trimming white space
           UNNEST(
            STRING_SPLIT(
              REPLACE(itemLocation, ', ', ','),      --standardizing delimiters from ", " to ","
            ',')))) 
         AS City                                     --Renaming this column so that we can use USING later in the join
                                                     --because I don't like having duplicate columns
  FROM ebay_women
  UNION ALL
  SELECT LOWER(title) AS title,
         sold,
         LOWER(   
          TRIM( 
           UNNEST(
            STRING_SPLIT(
              REPLACE(itemLocation, ', ', ','),        --standardizing delimiters from ", " to ","
            ',')))) 
        AS City
  FROM ebay_men
SELECT COUNT(*) AS num_rows
FROM sql_long_ebay
1 records
num_rows
5897
r_long_ebay <- 
  bind_rows (r_ebay_mens, r_ebay_womens) |>                               #add men and women ebay datasets together on top of each other
  mutate(itemLocation = str_replace_all(itemLocation, ", ", ",")) |>      #standardize the delimiters to "," before separating
  separate_longer_delim(itemLocation, delim = ",") |>                     #unnesting itemLocation and make it longer so 
                                                                          #then we can join on City and itemLocation 
                                                                          #this also trims white space 
  select(title, sold, itemLocation) |>
  mutate(itemLocation = tolower(itemLocation),
         title = tolower(title))

nrow(r_long_ebay)                                                        
[1] 5897

Appended pq_ebay_mens to pq_ebay_womens.

Filtered out columns to only include title, sold, itemLocation

standardized delimiters to ‘,’

Unnesting itemLocation to make it longer

Filtering out perfumes that were not sold Changing itemLocation and title to lowercase for consistency

Join us_pollution and clean_ebay on cities

Some of the itemLocations from r_ebay included places outside of the US like Hong Kong. Order/formatting was not always consistent Ex: New Jersey, Hong Kong Ex: Orange, New Jersey, USA Ex: USA, New Jersey, Hong Kong. We did an inner join with the us_pollution to cut out any states and countries outside of the US.

Working in R, we were confused since the SQL data set returned a different number of rows here initially compared to the R output…Did we have a duplicate or something? We then checked that there were no duplicate titles.

This confused us even more. How is the number of distinct titles so much less than our output?? Our hypothesis was that some titles were duplicated when we made the table longer and joined on City after delimiting itemLocation Ex: New York, New York, USA was then split into 3 rows with locations New York New York USA We had inner joined on City = itemLocation the pollution and ebay data hoping that only the city names would match and be kept However, this made no distinction between New York City and New York (state) Therefore, the titles were duplicated for those entries.

Then we needed to get rid of the duplicates and only keep the unique titles.

We were able to refine the output to 1451 perfumes with no duplicates. This is how we did a quick check across all languages to ensure nothing was wrong.

py_perfume_sales_by_city= (py_us_pollution
                                .merge(py_long_ebay,
                                        how = 'inner',
                                        on = 'City')
                                .drop(columns = ['Country'])            #dropping Country since these are all US cities
                                                                        #using py_perfume_sales_by_city.info() we found
                                                                        #there are some null values under "sold"
                                .dropna()                               #getting rid of null values
                                .sort_values(['title', 'City', 'sold']) #sort to make dropping duplicates command deterministic 
                                                                        #across languages
                                .drop_duplicates(subset='title')        #getting rid of duplicate perfume titles
                                .reset_index(drop=True)
                                )
py_perfume_sales_by_city.shape
(1451, 13)
py_perfume_sales_by_city.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1451 entries, 0 to 1450
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   City                1451 non-null   object 
 1   AQI Value           1451 non-null   int64  
 2   AQI Category        1451 non-null   object 
 3   CO AQI Value        1451 non-null   int64  
 4   CO AQI Category     1451 non-null   object 
 5   Ozone AQI Value     1451 non-null   int64  
 6   Ozone AQI Category  1451 non-null   object 
 7   NO2 AQI Value       1451 non-null   int64  
 8   NO2 AQI Category    1451 non-null   object 
 9   PM2.5 AQI Value     1451 non-null   int64  
 10  PM2.5 AQI Category  1451 non-null   object 
 11  title               1451 non-null   object 
 12  sold                1451 non-null   float64
dtypes: float64(1), int64(5), object(7)
memory usage: 147.5+ KB
CREATE OR REPLACE TABLE sql_perfume_sales_by_city AS
  SELECT DISTINCT ON (title) *
  FROM sql_long_ebay
    INNER JOIN sql_us_pollution USING (City)
  WHERE sold IS NOT NULL
  ORDER BY title, City, sold              --Sorting to make deterministic for discrepancies between languages
SELECT COUNT(*)
FROM sql_perfume_sales_by_city
1 records
count_star()
1451
r_perfume_sales_by_city <- 
  inner_join(r_us_pollution, r_long_ebay, by = join_by(City == itemLocation)) |>
  drop_na() |>
  arrange(title, City, sold) |>                                              #need to sort by title to keep results deterministic with python
  distinct(title, .keep_all = TRUE)
nrow(r_perfume_sales_by_city)
[1] 1451

inner join on itemLocation and City

expanding the table

sorting table by title to keep it consistent with python

removing duplicate titles

verifying final number of rows

Join perfume_sales_by_city with perfumes data set

Now we need to join the perfume_sales_by_city with the perfumes data set (covers perfume notes and intensities) on perfume title. Because of the extreme variety in the ebay title listings, we would not be able to join the two tables on their own. We utilized Claude AI to create an intermediary table so that we could match the two columns we would be joining on. The “perfume” column would be the controlled text whereas the “title” would be the uncontrolled free text where they may not be any matches. After reviewing the results in random samples, we settled on a final 0.6 confidence threshold.

#fixing case to be consistent across tables
py_intermediary[['ebay_title', 'perfume']] = py_intermediary[['ebay_title', 'perfume']].apply(lambda x: x.str.lower())
py_perfumes['perfume'] = py_perfumes['perfume'].str.lower()
py_perfume_sales_by_city['title'] = py_perfume_sales_by_city['title'].str.lower()

#left join with py_perfume_sales_by_city and py_intermediary on title

py_final = (py_intermediary
                                .rename(columns = {'ebay_title': 'title'})                      #renaming to match column between tables
                                .merge(py_perfume_sales_by_city,
                                        how = 'left',
                                        on = 'title')
                                .merge(py_perfumes,                                             #merge perfumes
                                        how = 'left',
                                        on = 'perfume')
                                .dropna()                                                      #a lot of null values where things weren't 
                                                                                               #sold/located
                                .sort_values(['title', 'City', 'sold'])
                                .drop_duplicates(subset=['title'])
                                .reset_index()
                                .filter(items = ['perfume', 'City', 'AQI Value', 'AQI Category', 'sold', 'brand', 'category', 'longevity'])
                                .replace({'longevity': 
                                {"6–8 hours": "Strong"}
                                })
                                )
print(py_final)
                       perfume        City  ...          category longevity
0         polo eau de toilette    brooklyn  ...    Woody Aromatic    Medium
1         polo eau de toilette    brooklyn  ...    Woody Aromatic    Medium
2    1 million eau de toilette      warren  ...       Woody Spicy    Medium
3    1 million eau de toilette  hackensack  ...       Woody Spicy    Medium
4              1 million lucky  hackensack  ...      Woody Fruity    Strong
..                         ...         ...  ...               ...       ...
429              mon paris edp  sacramento  ...        Floriental    Medium
430            black opium edp  fort wayne  ...   Ambery Gourmand    Strong
431                      y edp    brooklyn  ...  Aromatic Fougere    Strong
432                      y edp  sacramento  ...  Aromatic Fougere    Strong
433                      y edp  sacramento  ...  Aromatic Fougere    Strong

[434 rows x 7 columns]
--using CTE
CREATE OR REPLACE TABLE sql_final AS                                         --First joining intermediary and perfume_sales_by_city
WITH sql_med AS (
  SELECT 
    LOWER(intermediary.ebay_title) AS ebay_title,
    LOWER(intermediary.perfume) AS perfume,
    sql_perfume_sales_by_city.*
  FROM intermediary 
    LEFT JOIN sql_perfume_sales_by_city 
      ON LOWER(intermediary.ebay_title) = LOWER(sql_perfume_sales_by_city.title)
)
SELECT DISTINCT ON (ebay_title)                                            -- Now querying from new sql_med table for second left join
    sql_med.perfume, 
    City, 
    "AQI Value",
    "AQI Category",
    sold,
    perfumes.brand,
    type,
    category,
    CASE WHEN
      longevity = '6–8 hours' THEN 'Strong'
      ELSE longevity
      END 
    AS longevity
  FROM sql_med 
    LEFT JOIN perfumes ON LOWER(sql_med.perfume) = LOWER(perfumes.perfume)
  WHERE sold IS NOT NULL                                                    --getting rid of null values that would result
    AND sql_med.perfume IS NOT NULL                                         --from some perfumes not being sold to certain places
    AND City IS NOT NULL           
    AND perfumes.brand IS NOT NULL
    AND category IS NOT NULL
    AND longevity IS NOT NULL
  ORDER BY ebay_title, City, sold
SELECT * 
FROM sql_final 
Displaying records 1 - 10
perfume City AQI Value AQI Category sold brand type category longevity
polo eau de toilette brooklyn 35 Good 14 Ralph Lauren EDT Woody Aromatic Medium
polo eau de toilette brooklyn 35 Good 5 Ralph Lauren EDT Woody Aromatic Medium
1 million eau de toilette warren 50 Good 2 Paco Rabanne EDT Woody Spicy Medium
1 million eau de toilette hackensack 117 Unhealthy for Sensitive Groups 8877 Paco Rabanne EDT Woody Spicy Medium
1 million lucky hackensack 117 Unhealthy for Sensitive Groups 33 Paco Rabanne Parfum Woody Fruity Strong
1 million eau de toilette warren 50 Good 19 Paco Rabanne EDT Woody Spicy Medium
212 men nyc lincoln park 66 Moderate 207 Carolina Herrera EDT Woody Spicy Medium
212 vip hackensack 117 Unhealthy for Sensitive Groups 3691 Carolina Herrera EDP Woody Spicy Medium
acqua di giò edt thomasville 61 Moderate 111 Giorgio Armani EDT Aquatic Aromatic Medium
acqua di giò edt miami 55 Moderate 235 Giorgio Armani EDT Aquatic Aromatic Medium

Comparing SQL and Python Final table with anti-join to see if all entries are the same.

sql_final <- tbl(con, "sql_final") |> collect()

anti_join(sql_final, py$py_final, by = c("perfume", "City")) |>
  select(perfume, City, sold)
anti_join(py$py_final, sql_final, by = c("perfume", "City")) |>
  select(perfume, City, sold)
#fix case on dataframes
r_intermediary <- r_intermediary |>
  mutate(ebay_title = tolower(ebay_title),
         perfume = tolower(perfume))
r_perfumes <- r_perfumes |>
  mutate(perfume = tolower(perfume))

# Left join the perfume_sales_by_city with the perfumes data set

r_final <- r_intermediary |>
           rename(title = ebay_title) |>                                                         # Renaming to match column between tables
           left_join(r_perfume_sales_by_city, by = "title", relationship = "many-to-many") |>    # "many to many" to account for perfumes 
                                                                                                 # being sold in multiple cities, and having
                                                                                                 # multiple attributes
           select(-brand, -match_confidence, -match_quality) |>                                  # Drop columns
           left_join(r_perfumes, by = "perfume", relationship = "many-to-many") |>               # Merge perfumes
           drop_na() |>                                                                          # Drop null values
           arrange(title, City, sold) |>
           distinct(title, .keep_all = TRUE) |>                                                  # Drop duplicates based on title
           select(perfume, City, 'AQI Value', 'AQI Category', sold, category, longevity) |>
           mutate (longevity = recode(longevity, "6–8 hours" = "Strong"))
nrow(r_final)
[1] 434
#Checking to see that datasets have the same rows across all languages
anti_join(r_final, py$py_final, by = join_by(City))
anti_join(py$py_final, r_final, by = join_by(City))

Fixing case on intermediary table fixing case on perfumes table using inner joing to merge pq_intermediary with pq_perfume_sales_by_city

Expanding the table to include relevant information

Filtering out rows with no value in perfume column

Using inner join to merge the previous inner join with pq_perfumes

Expanding to include relevant columns Removing duplicates Verifying final number of rows Mutate to replace “6-8 hours” to “Strong”

#Analysis

Goal 1

Grouping by longevity of perfume, and then aggregating the AQI level, perfumes sold, and number of cities, we are able to connect the strength of a perfume to the AQI level of a city. It seems that there is a slight positive association between the two categories. The only outlier is “Light”, which we only have data from 15 cities of.

Even though the data may not be statistically significant, we would like to conclude that cities with more pollution (i.e. “smellier”) cause residents to choose stronger perfumes.

py_analysis_1 = (py_final
                       .groupby('longevity')
                       .agg(avg_AQI = ('AQI Value', 'mean'),
                            total_sold = ('sold', 'sum'),
                            total_cities = ('City', 'count'))
              )
print(py_analysis_1)
               avg_AQI  total_sold  total_cities
longevity                                       
Light        77.800000     62912.0            15
Medium       69.297945    171627.0           292
Strong       70.856000     88935.0           125
Very Strong  90.500000        84.0             2
CREATE OR REPLACE TABLE sql_analysis_1 AS
WITH recode AS (
  SELECT 
  longevity,
   "AQI Value",
    "sold",
    "City"
  FROM sql_final
)
SELECT 
  longevity,
  AVG("AQI Value") AS avg_AQI,
  SUM(sold) AS total_sold,
  COUNT(DISTINCT City) AS total_cities
FROM recode
GROUP BY longevity
ORDER BY longevity ASC 
SELECT *
FROM sql_analysis_1
4 records
longevity avg_AQI total_sold total_cities
Light 77.80000 62912 9
Medium 69.29795 171627 74
Strong 70.85600 88935 51
Very Strong 90.50000 84 2
r_analysis_1 <- r_final |>
    group_by(longevity) |>
      summarise(
          avg_AQI = mean(`AQI Value`),
          total_sold = sum(sold),
          total_cities = n()
      )
print(r_analysis_1)
# A tibble: 4 × 4
  longevity   avg_AQI total_sold total_cities
  <chr>         <dbl>      <dbl>        <int>
1 Light          77.8      62912           15
2 Medium         69.3     171627          292
3 Strong         70.9      88935          125
4 Very Strong    90.5         84            2

Grouping by Longevity

final table sorted

However, after running an ANOVA test to test the statistical significance of these values, we found that there is no significant relationship between the longevity of perfume worn and the AQI values of the city.

H0: Mean AQI values are the same across all longevity categories of perfume. H1: There is at least one longevity category with a different Mean AQI value.

from scipy.stats import f_oneway

# Extract data for each group
light = py_final[py_final['longevity'] == 'Light']['AQI Value']
medium = py_final[py_final['longevity'] == 'Medium']['AQI Value']
strong = py_final[py_final['longevity'] == 'Strong']['AQI Value']
very_strong = py_final[py_final['longevity'] == 'Very Strong']['AQI Value']

# Perform one-way ANOVA
py_anova = f_oneway(light, medium, strong, very_strong)
print(py_anova)
F_onewayResult(statistic=np.float64(0.6937626453656262), pvalue=np.float64(0.5562383465640923))
import matplotlib.pyplot as plt
import seaborn as sns

plt.clf()                                                     #clearing previous boxplots
sns.boxplot(
    data=py_final, 
    x='longevity', 
    y='AQI Value', 
    hue='longevity',                                          #filling in boxes by longevity category
    order=['Light', 'Medium', 'Strong', 'Very Strong'],
    flierprops=dict(markerfacecolor='red')                    # making outlier red
)

plt.title('AQI Value by Longevity Category')
plt.xlabel('Longevity')
plt.ylabel('AQI Value')

plt.show()

r_anova <- 
    aov(`AQI Value` ~ longevity, data = r_final)
summary(r_anova)
             Df Sum Sq Mean Sq F value Pr(>F)
longevity     3   1980   660.1   0.694  0.556
Residuals   430 409139   951.5               
ggplot(r_final, aes(x = longevity, y = `AQI Value`, fill = longevity)) +          #filling in boxes by longevity category
  geom_boxplot(outlier.color = "red") +                                           #coloring outlier red
  theme(legend.position="none") +                                                 #getting rid of legend since it's redundant
  labs(title = "AQI Value by Longevity Category",
       x = "Longevity",
       y = "AQI Value")

There is no built-in ANOVA functions. This analysis will have to be done in only either Python, R, or Power Query.

Started by removing all columns other than longevity and AQI Value. adding index column to be able to Pivot it wide

pivoted the column to make it wide with columns Light, Medium, Strong, VeryStrong removed index column After this, we loaded this data into the Excel file and closed Power Query. We then used Analysis ToolPak in Excel to use ANOVA.

Installing Analysis ToolPak Selected Anove Single Factor Put in input range, alpha value, and output cell range final ANOVA analysis

Box Plot for the data

The p-value is 0.556. Since p > 0.05, we cannot reject H0.

Goal 2

We see that categories like “Oud”, “Leather”, and “Woody” are more popular in cities with higher average AQI levels. Comparatively, categories like “Fresh” and “Musk” are popular in cities with lower average AQI levels. It seems to us that people favor more “intense” scent notes in cities with higher pollution.

py_analysis_2 = (py_final
                        .assign(category = lambda df_: df_['category'].str.split(' ')) 
#Some categories had multiple notes, we wanted to account for both, so we split them into two rows each
                        .explode('category')                                      
                        .replace({'category':                                             # fix syntax errors
                                {"Floriental": "Floral Oriental",                         
                                 "Ambery": "Amber",
                                 "Musky": "Musk"}
                                })
                        .assign(category = lambda df_: df_['category'].str.split(' '))    # re-split and re-exploded
                        .explode('category')
                        .groupby('category')
                        .agg(avg_AQI = ('AQI Value', 'mean'),
                            total_sold = ('sold', 'sum'))
                        .query("category != (['Unknown', 'Scent'])")                      # getting rid of vague categories
                        .sort_values("avg_AQI", ascending = False)
              )
print(py_analysis_2)
             avg_AQI  total_sold
category                        
Oud       103.000000       647.0
Leather    86.000000      1246.0
Green      80.666667      7926.0
Chypre     75.000000         3.0
Woody      74.644737    124597.0
Citrus     74.200000     53477.0
Fruity     73.083333     26524.0
Spicy      72.760417     59084.0
Aromatic   72.682171    174268.0
Sweet      71.500000        81.0
Aquatic    69.269231     10549.0
Fougere    68.023256     47809.0
Amber      67.632653     26431.0
Floral     67.484076     74779.0
Tropical   65.000000        17.0
Oriental   63.360825     33435.0
Gourmand   62.133333      1653.0
Fresh      49.500000      1008.0
Musk       42.200000       156.0
CREATE OR REPLACE TABLE sql_analysis_2 AS
--creating CTE
WITH split_category AS (  
  SELECT 
      UNNEST(STRING_SPLIT(category, ' ')) AS category,
      "AQI Value",
      sold
    FROM sql_final
),
recode AS (
  SELECT 
    UNNEST(STRING_SPLIT(
      CASE 
        WHEN category = 'Floriental' THEN 'Floral Oriental'
        WHEN category = 'Ambery' THEN 'Amber'
        WHEN category = 'Musky' THEN 'Musk'
        ELSE category
      END,
      ' '
    )) AS category,
    "AQI Value",
    sold
  FROM split_category
)
SELECT 
  category,
  AVG("AQI Value") AS avg_AQI,
  SUM(sold) AS total_sold
FROM recode
WHERE category NOT IN ('Unknown', 'Scent')
GROUP BY category
ORDER BY avg_AQI DESC
SELECT *
FROM sql_analysis_2
Displaying records 1 - 10
category avg_AQI total_sold
Oud 103.00000 647
Leather 86.00000 1246
Green 80.66667 7926
Chypre 75.00000 3
Woody 74.64474 124597
Citrus 74.20000 53477
Fruity 73.08333 26524
Spicy 72.76042 59084
Aromatic 72.68217 174268
Sweet 71.50000 81
r_analysis_2 <- r_final |> 
  # mutate(category = recode(category, "Floriental" = "Floral Oriental")) |>
  mutate(category = str_split(category, " ")) |>
  unnest (category) |>
  mutate(category = recode(category,
                           "Floriental" = "Floral Oriental",
                           "Ambery" = "Amber",
                           "Musky" = "Musk")) |>
  mutate(category = str_split(category, " ")) |>
  unnest (category) |>
  group_by(category)|>
  summarise(
    avg_AQI = mean(`AQI Value`),
    total_sold = sum(sold)
  ) |>
  filter(!category %in% c("Unknown", "Scent")) |>
  arrange(desc(avg_AQI))

print(r_analysis_2)
# A tibble: 19 × 3
   category avg_AQI total_sold
   <chr>      <dbl>      <dbl>
 1 Oud        103          647
 2 Leather     86         1246
 3 Green       80.7       7926
 4 Chypre      75            3
 5 Woody       74.6     124597
 6 Citrus      74.2      53477
 7 Fruity      73.1      26524
 8 Spicy       72.8      59084
 9 Aromatic    72.7     174268
10 Sweet       71.5         81
11 Aquatic     69.3      10549
12 Fougere     68.0      47809
13 Amber       67.6      26431
14 Floral      67.5      74779
15 Tropical    65           17
16 Oriental    63.4      33435
17 Gourmand    62.1       1653
18 Fresh       49.5       1008
19 Musk        42.2        156

Start by removing all columns not needed

Replaced Floriental to Floral Oriental

Split cateogry column by delimiter by space

Replaced Ambery to Amber

Replaced Musky to Musk

Removed rows with values Unknown and Scent

Grouped by cateogry and added column for average AQI

Upon first glance, there does not seem to be any statistical significance. We will not move forward with performing an ANOVA test for this analysis.

Conclusion

There was no statistical significance in either of our analyses to draw out conclusions regarding:

  1. whether longer lasting perfumes are preferred in cities with higher AQIs; and

  2. whether perfume with strong notes are preferred in higher AQI cities.

Limitations

The statistical significance of this data is low given that our data set was relatively small (at only 435 final rows), particularly in the fact that only 84 units of “Very Strong” perfume were recorded being sold to only 2 cities. We also must acknowledge the fact that the eCommerce Data “itemLocation” was not clear in specifying whether these were warehouse locations where perfumes were being stored, or if these were locations where the customers were located. We chose to interpret it as the latter. There were also locations that may have been listed as States (it was not clear), but were counted as cities when we joined the tables. For example, while there is a state named “Delaware”, there is also a city with the same name in Ohio. The global pollution data did not provide corresponding states, only the country. Additionally, the global dataset did not have a date associated with it but was uploaded 3 years ago, so we will assume that it is relatively up to date and can still be used in conjunction with the eCommerce data from 2024.

There could have also been additional data that we did not account for and may have been dropped in the AI intermediary table creation. The datasets were also not completely comprehensive: there were cities that were not listed, and popular perfume brands that were not named.

Challenge Resolution

Our primary challenge in this project was data cleaning and integration across multiple datasets. Specifically, there was no straightforward way to exactly match perfume titles across the different title columns using the techniques covered in class. The same perfume often appeared under slightly different names in the e-commerce dataset, which made direct matching difficult and pushed our work beyond standard data wrangling. To address this, we used Claude and ChatGPT to assist finding a match between the titles and perfumes. The link to the prompts and iterative process of creating the intermediary table with Claude can be found here.

Another major challenge involved differences in case sensitivity across analytic frameworks. During the analysis phase, we discovered discrepancies in the average AQI values across frameworks when grouped by longevity and category, despite having exactly same row counts in the final merged dataset. This revealed that inconsistent text casing was affecting how values were transformed. To resolve this, we converted all relevant text columns to lowercase to ensure consistent matching and equivalent treatment across all frameworks.

Takeaways

In Python, we were not as familiar with performing statistical tests and prompted Claude and Google AI for help. We discovered that we would need to install the “scipy” package into the Python Pandas environment (this is also how we learned about using the “matplotlib” and “seaborn” packages to create a customized boxplot). While not as intuitive as in RStudio, we were able to conduct the ANOVA test by separating the AQI Value data into the longevity categories, and running the ANOVA command (f_oneway) across all of them.

For SQL, we were unable to perform ANOVA directly, as the framework does not provide built-in support for this type of statistical analysis.

In RStudio, ANOVA functionality is inbuilt, allowing us to carry out the statistical analysis without relying on any additional toolkits. We also used the “ggplot” package and practice covered previously in the course through DataCamp in order to create a simple boxplot, as well as the STHDA guide to boxplots for further customization.

For Power Query, one team member revisited concepts from the Quantifying UX course taught by Dr. Jacek Gwizdka. This course covered hypothesis testing for categorical data using ANOVA and the Analysis ToolPak in Microsoft Excel. We revisited previous assignments completed for the course to familiarize ourselves with the process to conduct the test. We used pivot columns in Power Query to widen the data by longevity and then applied the ANOVA test across the four longevity categories using Data Analysis feature from Analysis ToolPak. Finally, we used Excel’s built-in chart features to visualize the results.